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ABSTRACT 



Data to be analyzed is transferred from one or more user 
systems to a host system, which includes an analysis/ 
decision support module. Queries are generated, either auto- 
matically by the analysis/decision support module, or by the 
user, who then submits them to the host system. More than 
one user may participate in the system, including transfer- 
ring data to the host. This joint participation includes the 
option of collaboratively submitting or adjusting queries and 
viewing the results of the data analysis, either in real time, 
or asynchronously. Data used as the basis of an analysis may 
therefore come from different entities, even from data bases 
that are available publicly via the network, but whose 
owners arc not participants in the collaborative, hosted 
analysis system according to the invention. The host system 
thus acts as a network portal through which different users 
may store and share not only data for analysis, but also the 
results of such analysis. 

6 Claims, 7 Drawing Sheets 
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SYSTEM AND METHOD FOR 
COLLABORATIVE HOSTED ANALYSIS OF 
DATA BASES VL\ A NETWORK PORTAL 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 

This is a Conlinuation-in-Part of and claims priority from 
pending U.S. patent application Ser. No. 09/479,194, filed 
Jan. 7, 2000 which is a Continuation-in-Part of U.S. patent 
application Ser. No. 08/850,828 filed on May 2, 1997 now 
U.S. Pal, No. 6,014,661 which claims benefit of Provisional 
Application No. 60/019,049 filed May 6, 1996. 

BACKGROUND OF TIIE INVENTION 

1. Field of the Invention 

This invention relates to a method and system for access- 
ing and automatically analyzing data in one or more data 
bases and for allowing at least one user to selectively view 
the results of the data analysis based on interactive queries. 

2. Description of the Related Art 

At present, when a user wishes to analyze the data in a 
data base, he faces the tedious task of entering a series of 
search parameters via a screen of input parameters. At times, 
the various queries must be linked using Boolean operators, 
and changing one parameter or operator may often neces- 
sitate changing many other less crucial parameters so as to 
keep them within the logical range of the input data set. 
Similar difliculties are now also arising when a user or a 
search engine scans many Internet sites to match certain 
criteria. 

Furthermore, the concept of "analyzing" the data in a data 
base usually entails determining and examining the strength 
of relationships between one or more independent data 
characteristics and the remaining characteristics. This, in 
turn, leads to an additional difficulty — one must decide what 
is meant by the "strength" of a relationship how to go about 
measuring this strength. Often, however, the user does not or 
cannot know in advance what the best measure is. 

One common measure of relational strength is statistical 
correlation as determined using linear regres.sion techniques. 
This relieves the user of the responsibility for deciding on a 
measure, but it also restricts the usefulness of the analysis to 
data that happens to fit the assumptions inherent in the linear 
regression technique itself. The relational information pro- 
vided by linear regression is, for example, often worse than 
useless for a bi-modal distribution (for example, with many 
data points at the "high" and "low" ends of a scale, but with 
few in the "middle") since any relationship indicated will 
not be valid and may mislead the user. 

Another problem with existing data base analysis systems 
is that they are in general centralized, meaning that the data 
bases, the query and analysis engine, and the display system 
are all contained within the same general system, at the same 
site. This means that a u.ser with a large data set hut no 
powerful analysts engine miLSt first find and install the 
engine before being able (o study the data set. Along with 
such a standard solution to the problem comes the need to 
maintain the software. This solution is particularly ineffi- 
cient when there is no on-going need to analyze the stored 
data. Moreover, if the user wants to analyze data in a data 
base not at his own site, but rather in a remote, possibly 
publicly available data base, then he would either have to 
hope that the remote site has proper data analysis software, 
or else he would have to acquire the data set and study it at 
a site that has the proper software analysis tools. This would 
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be unwieldy at best and pos.sibly impossible if the remote 
data base is very far away, or is distributed among different 
sites, or has a data set so large that importation into the 
user's own analysis system is impractical. 

5 Yet another problem arises where more two or more users 
wish to be able to share not only data, but also the ability to 
analyze it, and then perhaps even share the results with still 
other enthies. If only one entity has the ability to analyze the 
data, then it will be difficult or impossible to allow others to 

10 help direct or otherwise participate in the analysis or its 
results. This makes it hard for different users in a single 
company to most efficiently develop and share results of 
analysis of data, especially when the users are at different 
physical sites. For example, researchers working in a large 

15 pharmaceutical corporation, as well as data they collect, are 
often located at facilities far away from each other. 

What is needed is a system that can take an input data set, 
select suitable (but user-changeable), software-generated 
query devices, and display the data in a way that allows the 
user to easily see and interactively explore potential rela- 
tionships within the data set. The query system should also 
be dynamic such thai it allows a user to select a parameter 
or data characteristic of interest and then automatically 
determines the relationship of the selected parameter with 
the remaining parameters. Moreover, the system should 
automatically adjusts the display so that the data is presented 
logically consistently. 

The system should preferably make it possible for a user 
either to analyze remote data sets, or to analyze local data 
sets without needing to acquire and install specialized analy- 
sis software, or both. It should preferably still be possible to 
analyze local data bases even though they may be installed 
behind a so-called "firewall." 

35 It should also be not only possible but easy for users even 
at different locations to be able to access each other's data, 
and preferably to incorporate even other data into their 
analysis. Ideally, the participants in the analysis system 
should not have to be within the same organization; rather, 
it should be possible for people to collaborate in and share 
the results of data analysis even in the context of an 
extended/virtual enterprise, in which the participants may be 
spread across multiple organizations, and across multiple 
sites. As just one example, the system should easily accom- 

45 modate a research project involving a collaboration of 
research efforts by a pharmaceutical company, a biotechnol- 
ogy company, and a university research institution. It should 
be possible lo readily share not only data, but even the 
results of the analysis of the data, such as visualizatioas, 

50 reports, computations, etc., preferably even with e-mail 
notification. This invention makes this possible. 

SUMMARY OF THE INVENTION 

The invention provides a method and a related system for 
55 processing data from at least one data base. The main steps 
of the method according to the invention are: 1) transferring 
to a host system, via a network such as the Internet, from at 
least one participating user system other than the host 
system, the data from the data base(s); 2) in the host system, 
60 analyzing the data from each data base according to an 
analysis routine and then generating analysis results; 3) in 
(he host system, generating a representation of the analysis 
results; and 4) transferring the representation of the analysis 
results via the network for display on at least one parlici- 
65 paling user system. 

In the preferred embodiment of the invention, a memory 
region is allocated in the host system for each participating 
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user system. Each memory region stores data from each data connected to the network, may then be analyzed in the 

base transferred via the network from each respective par- central host. Users may view the results of the analysis, 

licipating user system to the host system. Each metnory change parameters, and thus interactively analyze the data, 

region may also store at least address information indicating but may optionally do so collaboratively, and either in real 

the location of the transferred data within the host system. 5 time, or asynchronously. Other users may add or remove 

The address information may include, for example, a net- jata from the analysis, or change the viewing parameters, 

work address of at least one external data base that is ^ased on the same initial data set; the system then allows 

accessible for downloading from a non-parttcipatrng com- i^^^ ^jj,^^ ^^j^lc relationships in the data, 
puter system that is connected to the network. In this case, 

each such external data base is accessed by the host system jg BRIEF DESCRIPTION OF THE DRAWINGS 
via the network and then downloads the external data base 

data into a memory of the host system. Even when the data ^ illustrates the mam components of an analysis 

from the data base(s) is transferred from one participating ^y"^'^"" according to the invention for retrieving and d,s- 

user source system, the repre.senlation of the analysis results P'^y^ng data from a data base. 

may be transferred to a the participating user systems other FIG- 2 illustrates examples of device queries thai can be 

than the participating user source system. "s^d the invention's interactive display. 

The invention may operate with data base data stored or FTG. 3 illustrates the main processing steps the invention 

arranged according to any known data structure. In the follows for one example of the use of the invention to 

preferred embodiment of the invention, however, the data vLsualize relationships between data in a data base, 

base data is structured into records, each record having one FIG. 4 illustrates a decision tree, which is one method that 

or more fields. Each field contains field data, has a field can be used in the analysis system according to the invention 

name and one of a plurality of data types. Given this data in order to determine and define the structural relationship 

structure, a decision support module in the host system between different data fields in a data base, 

according to the invention then automatically selects an 5 illustrates a display of the results of a data base 

initial, adjustable, graphical query device as a function of ^5 analysis y^i^g ,he analysis system according to the inven- 

and adapted to a type and range of the correspondmg field jj^^ 

data. Each graphical query device is then transferred via the „, _ j 

, TT. u . FIG. 6 IS a block diagram that illustrates a system accord- 
network to at least one participating user system. The host , . , ■ ■ 

.. ■ , 1 J- ,„ , u .1, ing to the invention in which the analysis system IS remotely 

system then senses, via the network, adjustment by the user •, , ■ , , • 1 j , , 

, • ^- ■ .■ . . !_■ L u I.- 1 accessible via a network so that it can be u.sed to analyze data 

of each participating user system to which eacii graphical m . , , , . , ■ j . 

, . ,. u . c J c t .1. J- 1 J in a data base at a iLser s site, or at a third-party site 

query device has been transferred of any of the displayed, 1.1 ■ .1. , , 

J- . 1.1 u - 1 J • Tn, 1, . . _ .u accessible via the network, or both, 
adjustable, graphical query devices. Ine host system then 

updates the representation of the analysis results correspond- P'G- ^ block diagram of a remote-hosting embodi- 

ing to the sensed adjustments of any of the query devices, ment of the invention, in which the analysis system is 

thereby enabling interactive visualization of the analysis 35 centrally hosted and remotely accessible via a network to 

results of the data via the network. At least one of the user any number of users. Data to be analyzed is either stored 

systems to which graphical query devices are transferred within a user memory space in the central host, or it may be 

may be one of the participating user systems other than the imported via network links, or both, 

source user system. DETAILED DESCRIPTION 

A log may be maintained, preferably in the user- 40 

associated, allocated memory regions, of accesses to the data Ilie invention is well suited for interactive visualization 

stored in the respective memory regions. The host system and analysis of data from any type of data base. Just a few 

may then notify, via the network, each user whose corre- of the thousands of possible uses of the invention are the 

sponding data, stored in the respective memory region, is visualization and analysis of financial data, marketing data, 

accessed by any other participating user. 45 demographic data, experimental data, environmental data. 

The decision support/analysis module in the host system log'^li's data, World-Wide Web log files, manufacturing 

may implement any known data analy.sis routine. In the case tiata, biostatislics, geographic data, and telephone traffic/ 

where each data base contains a plurality of records and each usage data. 

record includes a plurality of data fields, however, the The invention includes a data analysis module or 

decision support module may analyze the data from the data 50 "engine," and various embodiments, each with a different 

base(s) by automatically detecting a relational structure system configiu-ation in which the location of the data 

between the data fields by calculating a respective relevance analysis engine, of the various user systems, and of the data 

measure for each of the data fields. The relevance measure •<> be used as the basis for the analysis differ. These are 

is preferably a data type-dependent function indicating a described in turn. The preferred method and system for 

measure of relational closeness to at least one other of the ss awomatically analyzing data arc described first, 

data fields. 'Vhe host system then generates a graphical 'llie main components of the simplest configuration of the 

representation of the relational structuire and transfers this system according to the invention are illustrated in FIG. I. 

graphical representation via the network for display on at A data base or data set 100 (one or more) may be stored in 

least one participating user system. any conventional devices such as magnetic or optical disks 

Results of the data analysis may he generated and pre- so or tapes and semi-conductor memory devices, 'ITie size of 

sented in many different forms, such as on-screen the data set may be arbitrary, as the invention has no inherent 

visualizations, reports, computations, etc. User systems then limitations in the size of the data base it can access and 

communicate with the host system, preferably via a publicly analyze. 

accessible network such as the Internet, or via a proprietary A main processing system 110 may be implemented using 

network such as are found within some enterprises, in many 65 a microprocessor, a mini- or mainframe computer, or even a 

cases via a browser. Data stored not only in the user space, plurality of such processors operating in parallel or as a 

but, optionally, even imported from external data bases pipeline. The proccsstag configuration may be chosen using 
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normal design techniques based on the size of the largest 
data set one expects to have to process, and on the required 
processing times. The processing system 110 includes, 
among other sub-systems, a sufiBciently large memory 112 to 
store all data used in the data classification and display 
procedures described below. The processing system 110 
forms the analysis system that enables a user to query one or 
more data bases and view results of the data classification 
according to the user's queries. 

The data set 100 (that is, its storage device) is connected 
for communication with the main processing system by 
means of any conventional channel 114, which may be a 
dedicated channel, or a general-purpose channel such as 
telephone lines (for example, for connection through a 
network, including the Internet), fiber-optic or high- 
bandwidth metal cables, or radio or micro-wave links. 'ITie 
size of the data set and the desired processing speed will in 
general determine which channel is appropriate and fast 
enough. The data set need not be remote from the processing 
system although this is the case in while or in part in some 
embodiments of the invention, which are described and 
illustrated below. Rather, the data set's storage device 100 
may even be a peripheral memory device connected directly 
to the processing system 110. 

In most applications of the invention, the user will wish 
to see a graphical display of some feature of the data set. 
This is not, however, nece.ssary — the invention may be used 
as a sub-sy.stem that queries the data base 100 and organizes 
the data for a .supervisory routine, which then processes the 
data automatically in some other way. For example, the 
invention may be used in a system that automatically 
generates lists of potential customers of a product chosen 
from a large data base of consumer information. In the 
typical case, however, the results of the data processing 
using the invention arc to be interactively displayed and to 
that end, a display unit 120 is preferably connected to the 
main processing system 110. ITie display unit may be a 
standard device such as a computer monitor or other CRT, 
LCD, plasma or other display screen. Standard display 
drivers (not shown) are included in the display unit 120 and 
are connected to the processing system 110 in any conven- 
tional manner. 

A conventional input system 130 is also connected to the 
main proces.sing system in the normal ca.se in which the user 
is to select initial data classification parameters. The input 
system may consi.st of a single standard positional input 
device such as a mouse or a trackball, or an alphanumeric 
input device such as a keyboard, but will normally include 
both types of devices. The display unit 120 itself may also 
form part of the input device 130 by providing it with 
standard touch-screen technology. The connection and inter- 
face circuitry between the input system 130 and display unit 
120 on the one hand and the processing system 110 on the 
other hand may be implemented using standard components 
and is therefore not described further here. 

The main procedural steps carried out by the invention are 
as follows: 

1) The main processing system 110 accesses the data base 
100 in any known manner and exchanges standard protocol 
information. This information will normally include data 
indicating the .si/£ of the stored data set as well as its record 
and field structure. 

2) The processing system then downloads records and 
classifies them by type (also known as attribute). Some 
examples of the many difi'crcnt possible types of data 
include integers, Hoating-poinI numbers, alphanumeric 
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characters, and special characters or codes, lists and strings, 
Boolean codes, limes and dates, and so on. In a data base of 
films, for example, each record may have data concerning 
the title and the director's name (alphanumeric attribute), the 
5 release year (an integer), whether the film is a comedy, 
drama, documentary, action film, science fiction, etc. 
(marked in the data base as an integer or alphanumeric 
code), whether the film won an Academy Award (logical) or 
how many it won (integer). As is described in greater detail 
below, the system according to the invention preferably 
automatically type-classifies the various fields in the data 
base records. In certain cases, however, the data base itself 
(in the initial protocol and structural information) may also 
indicate the types of the various fields of the records; in such 
case, the processing .system may not need to type-classify 
the fields and can omit this step. 

3) For each record set that the processor has classified, it 
then (or simultaneoiwly) determines the range of the data for 
each field in the set. This can be done in any of several 
standard ways, and different methods may be used for 
different data types. For numerical data sets, the system may 
simply search through the set to determine the maximum 
and minimum values as well as (if needed), the average or 
median values to aid in later centering and scaling of a 
corre.sponding dijsplayed query device. The system prefer- 
ably al.so counts the number of different values in each set 
of fields in all of the records in the data set. Ranges may also 
be predetermined; for example, if the iLser wishes to include 
in the data base search data records sorted alphabetically by 
surnames of Americans or Britons, then the range of first 
letters will be no greater than A-Z (a range count of 26), 
although a search of the actual records in the data base might 
show that the range of, say, A-W is sufficient (with a range 
count of 23). Names (or other text) in other languages, the 
alphabetical range and rang.e count may be either greater or \ 
smaller; for example, Swedish text could begin on any of 29 
different letters (A-Z, AaO). 

4) The .system then analyzes the relational structure of the 
data records using any or all of a plurality of methods. These 

40 methods include regression, decision trees, neural networks, 
fiizzy logic, and so on. According to the invention, the 
system preferably applies more than one method to deter- 
mine the structure of the data and then either selects the 
"best" method in some predetermined sense, or else it 

45 presents the results of the different structural determinations 
to the user, who then may then select one that appears to give 
the best result. 

5) Once the system has determined the data field types and 
ranges, the system determines a user interface to be dis- 

50 played on the display unit 120. The results of the structural 
relational analysis are also preferably used to order the 
various query devices that are displayed to the user to give 
him guidance in finding the strongest relationships among 
the various fields of the data base. The interface preferably 

55 automatically selects (at least initially — later, the system 
automatically presents alternatives to the user, from which 
he may select) the lay-out of query devices (described 
below), coordinate axes (cither automatically or under user 
control) and scales, display colors and shapes, the degree of 

60 "zoom" of the display (if needed), or other features depend- 
ing on the particular application and user preferences. 

6) In many cases, there will be so many records in the data 
base that it would take too long for the system to search 
through all records in the data base to determine the record 

65 type or range. The invention therefore preferably includes 
the procedural steps of first determining the number of 
records in the data base, and, if the number of records 
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exceeds a predetermined threshold, sampling the data set to alphabetical limit " W" will have been determined during the 

determine the record type and to extrapolate its range from range detection step. In the illustrated example, the user has 

the range of the sample. Conventional techniques may be manipulated a standard screen cursor 220 (for example, 

used to determine the number of records and their type; using a trackball or mouse included in the input system 130) 

indeed, in many cases, the number of records is a parameter 5 to move the slider 230 approximately to the right-most range 

included in the data base itself. The threshold for sampling of the "B" entries. In other words, the user is requesting the 

may also be determined using conventional design criteria system to find data records for which the corresponding data 

and will include such factors as the available time allowed set records start with a "B". 

for data transfer and processing, which will in part be j^f. "scale" of the alphaslidcr need not be alphabetical; 

determined by the speed of the chosen processor. lo rather, instead of the letters "A", "B" "W" and so on. 

Different sampling techniques may be used. For example, the system could display numbers, one of which the user is 

every n'th record can be examined (where n is determined lo select by "touching" and moving the slider 230. The user 

by what percentage of the records the system can examine may move the slider 230, for example, by placing the tip of 

in a given time); or a predetermined percentage of the the cursor 220 on it and holding down a standard mouse 

records can be selected randomly; or records may be is button while moving the mouse lo the left or right, releasing 

sampled randomly until a predetermined statistical signifi- the button when the slider is at the desired value, 

cance has been achieved, elc. Once the sampling process has The illustrated range slider query device 210 has a scale 

been completed, the entire data set can be downloaded and the same as the single-valued slider, but, as its name 

processed for display using the type and range classifications indicates, is used to select a range of values. To do so, the 

of the sample. 20 ^^^^ "touches" either the left range slider 240 or the right 

7) In a remote processing embodiment of the invention, range slider 242 and moves it as with the slider 230. In the 

the data to be classified, analyzed, and displayed is located illustrated example, the user has selected a query such that 

at a local user's site, or in a data base that is accessible via only those relations should be displayed for which the 

a network such as the Internel, or both, but the data is chosen attribute has a value between about 13 and about 72. 

accessed and processed as above at a remote site. Excluded values are here displayed "shaded" on the slider 

As FIG. 1 illustrates, the steps of type detection, range query device, 

determination, structure identification, interface selection Many variations of the illustrated sliders may be used in 

and sampling may be carried out in dedicated processing the invention, such as those that indicate which values are 

sub-systems 140, 142, 144, 146 and 148, respectively. Note not lo be included (for example, by "clicking" on an 

that many of the processing steps described above (for appropriate portion of the slider display to indicate by 

example, type/range determination and interface selection) shading that the logical complement of that portion of the 

can be carried out in parallel as well as in scries. Assuming range is to be applied), that indicate ranges inclusive at one 

the chosen processing configuration is fast enough, however, extreme but exclusive at the other (for example, by clicking 

all of these steps, or any combination of steps, may be on the range marker to toggle it lo different logical stales), 

carried out by the same processor. and so on. A more complete discussion of the possibilities is 

In most conventional systems, data base searches are 8'ven in the inventors' papers "Exploring Terra Incognita in 

interactive only in the sense that the user's initial "guess" 'he Design Space of Query Devices," C. Ahlberg & S Truve, 

(search profile) can be modified and re-submitted— the user Dept. of Computer Science and SSKKIl, Chalmers Univer- 

is given little or no guidance or indication of the size and 4^ si'y of Technology, Goleborg, Sweden; the article " Ilie 

range of the data involved in the various possible choices for Alphaslider: A Compact and Rapid Selector," C. Ahlberg & 

the search profile. As such, the user might, for example, B. Schneiderman, Proceedings, ACM SIGCHI '94 Apr. 

initially submit a search profile with no possible "matches." 24-S 1994; and "Dynamic Queries for Information Explo- 

By initially analyzing the data set to determine data lypes ration: An Implemenlalion and Evaluation," ACM SIGCHI 

(attributes) and ranges, the invention is able to create an 45 '92, May 3-7 1992. 

initial query environment that allows the user to avoid such The illustrated example also shows a toggle 250 on which 

wastes of time. Even fiirther time-saving procedures unique the u.ser has "clicked" (for example, in the standard way, by 

to the invention are described below. "touching" the toggle with the cursor on the display screen 

FIG. 2 illustrates examples of dynamic query devices that and pressing a mouse button) to indicate that the feature "Y" 

the processing system may generate on the screen of the 50 should be present in the displayed data. That the toggle is 

display unit 120, depending on the type and range of the "on" may, for example, be indicated by the processing 

data. The various data query devices are generated and system by displaying it darker, by superimposing a cross 

located for display using any known software, such as is ("X") on it, or in some other conventional way. 

readily available for writing display software for Microsoft A checkbox 260 contains more than one toggle. In the 

Windows applications or similar software packages. One of 55 illustrated example, features B and C have been selected for 

the most useful query devices is the slider, which may be inclusion as a data query, whereas A and B have not. A 

either a single-slider query device 200 for indicating single displayed dial 270 is yet another example of a query device, 

alphabetical or numerical characters, or the rangeslider Using the cursor, the user pulls the pointer 272 clockwise or 

query device 210 for indicating ranges of alphabetical or counter-clockwise and the system displays the value (in the 

numerical characters; two-dimensional single and range eo example, "73") to which the pointer is currently pointing, 

sliders may also be u.sed. Other query devices may be used, for example, pull-down 

In FIG. 2, the attribute of the data field associated with the menus and two-dimensional sliders (for example, one on an 

single slider 220 is alphabetical. One sees in this example x-axis and another on a y-axis). 

that the data in the indicated field has relatively very few FIG. 3 is a block diagram that shows not only the main 

"A" entries, relatively many "B" entries, few or possibly no 65 paths of data flow in the invention, but also is a more 

"J" entries, many "S" entries, and so on. The user can also detailed functional block diagram of the system shown in 

sec that there arc no "X", "Y", or "Z" entries — the upper FIG. 1. Reference numbers for the functional blocks are the 
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same as those in FIG. 1. Furthermore, the sampling sub- English-language alphabet (upper and lower case, 

system 148 has been omitted from FIG. 3 since its operation respectively). In the example shown in FIG. 3, this would be 

is described above and since it will simply reduce the the case for F2: Name. If only two dilferent values are 

number of data records inilially passed on to the type detected (especially, 00000000 and 00000001), then the 
detection, range detection and possibly the structure detec- 5 system may, for example, assume that the corresponding 

don sub-systems described below. A^ld contains Boolean data or; if the two values also fall 

, , , . . . . within the ASCII alphabetical range, then the system may 

As IS mentioned above, the invention can be used for • , j \^i„ •„ ° \ i, ,u. fi.u 

r J L J J L . instead (or, temporarily, in addition) mark the field as an 

many different types 01 data bases and data base structures. , ^ u .• i c ij r- rjT-< • u. .u u -.i. n i 

"1 , . ,u . .1. alphabetical field. Field F4 might thus be either Boolean (if 

Merely by way of example, however, assume that the user t- ,j ■ mi, ->.. fT t-a • u. • j- . -.C 

. , ' ' / , *^ , , ■ . • e J u . in the Field name is "Woman? theti F4 might indicate either 

wishes to analyze the possible relationships found between „ „ „ „ ... . . u i j r.\ • i 

, , , yes or no with binary numbers 1 and 0) or a single- 

various items in a data base or customer purchases tor a ; . . ■ . . .- , ,„rn c c i .-wo r i \ 

. ■ ~ . „,..- . J .•£-!_ element String alphabetical (F for temale, M for male), 

chain or stores that sell clothing, shoes, and cosmetics. Such ,, ■ , .u j .u . n • i i j- .- u 

. ... ■■ J . ■ ,1 r 1 r 1, Using known methods, the system will similarly distinguish 

data might be compued automatically, tor example, tor all . . ■ , j « ■ . u r. u 

^ . \ , J ; . ., between integers and noating-pomi numbers, often by a 

customers who use the store sown credit card. This situation , , . 7 ,u c u . 7 ■. ic r .i. . i 
. .„ , . „,,, _ 1- knowledge ot the held structure itseli trom the protocol 

is illustrated in FIG. 3. j , ■ f , ■ n , j u ■ i j . 

data — integers are typically represented by single data 

As is common, the data base 100 is organized as a scries ^^^^^^ whereas floating point numbers will typically require 

of records Rl, R2 Rm, each of which has a number of ,^0 separate data words for the whole-number and decimal 

fields Fl, F-2, . . . , Fn. In this example, there are ten fields portions. Indications of the data types are then preferably 

per record (here, n=10but the number of fields per record in ^y^^^ (he memory 112 as a data type table 340 in the 

actual data bases may of course be greater or less than memory 112. In FIG. 3, field FTc has been identified as 

ten- — the invention is not dependent on any particular num- having data type Tk, (k=l , 2, . . . , n). 
ber ot records or fields) The fields (Fl, F2 . . FIO) in the ^^^^ ^ ' ^^^^^^^^ sub-system 142 deter- 

example are: Fl) an <dentmcation code for the customer ^j^^^ ^^^^^ ^-^^^ numerical fields, for 

associated with the record; F2) F3) and F4) the customer s example, this will typically be the maximum and minimum 

name, age and sex, respectively; F5) the total amount the ^^y^^^ ^^^^ this will typically be the 

customer has spent (durmg sotne predetermined period); ^j^^^ ^j^^^^ ^X^hzhti. ITie number of 

F6), F7), and F8) the amount the customer has spent on jyj^^^^, preferably also accumulated for each field, 

clothing, shoes and food, respectively, dunng this period; Additional data may also be tabulated as desired or needed. 

F9) the date of the customer s most recent purcha.se; and For string data, for example, for each string field, the system 

FIO) a number representing how frequently the customer ^. j,, , „ ,3ble of the number of times 

makes purchases (for example, measured m transactions per ^^^^ „f ^Ift^^i^x occurs first in the field in order to 

'■ reduce clutter in the later display by eliminating non- 

The illustrated data base also includes standard protocol occurring letters. The median of the occurrence table may 

data as well as field names associated with the different then be calculated and used for later centering of the scale 

fields. The protocol data will typically include data indicat- of the associated query device (see below). For numerical 

ing the total number of bytes (or words) the data base gelds, the range detector 142 may additionally calculate 

contains, how many records, how many fields per record, such statistical range data as the mean, median, and standard 

and how many bytes (or words) each field consists of If the deviation of the field data. All calculated range data is then 
protocol is already standardized or otherwise pre- ^ preferably stored in a data range table 342 in the memory 

determined between the data base 100 and the main pro- in. In RG. 3, field Fk has been identified as having range 
ccssing system, then there will be no need for the protocol (k=l, 2, . . . , n). 

fields. Moreover, the field name data will not be necessary is' mentioned above, the type- and range-detection 

if it is already established in some other conventional sub-systems 140, 142 may operate either in series or in 

manner for the user or the main processing system of the p^f^n^j g^en with a single processor implementing both 

invention what data the various record fields represent. sub-systems, these two sub-systems may operate "simulta- 

In the preferred embodiment of the invention, the main neously" in the sense that each operates on a single data 

processing system 110 automatically detects the type of data value before the next is proces.sed, in order to reduce 

in each of the record fields, unless the data types are already processing lime by having only a single download of the 
specified by the data base in the protocol or field names data. 50 data. For example, the range detector 142 may use each 

This may be accomplished using any known data type accessed data word as soon as the type detector is finished 

detection routine, as long as the number of records in the with it and then use it in the on-going, cumulative calcula- 

data base is large enough to allow the detection routine to (ions of minima, maxima, means, and all other range data for 

make statistically relevant deductions about the data. For the corresponding field. Once the type detector determines 
example, in order to detect the type of data in field Fk (k=l, 55 the data type for each field, the range detector can then 

2 m), the processing system may access (that is, discard range data calculations that are inappropriate to the 

download in bulk, read in and process sequentially, etc.) all detected type. For example, in general there will be no need 

of the field data (Rl, Fk), (R2, Fk), (Rm, Fk), where (Rj, Fk) for a calculation of the median or mean of Boolean or entire 

indicates the k'th field of the j'th record. Any of many strings of data (although, as is mentioned above, there may 
different known tests may then be applied to determine the gg be for first letters). 

data type. Once the type and range of the fields have been 
For example, if all (or more than a pre-defined determined, the system then automatically determines van- 
percentage) of the bytes of all of the fields Fi (that is, field ous relationships between the different data fields. 
Fi in all of the records) correspond to binary numbers from Preferably, several different methods are used, from which 
97-122 and 65-90, then the system may assume the field 65 the system initially selects a "best" method in a predeter- 
contains data with an alphabetical (string) attribute (type), mined sense, and also orders query devices in such a way on 
since these arc the ranges of the ASCII codes for the the display that ihcy indicate to the user which relationships 
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are the strongest. In some applications, the user knows 36-55 year-old women who purchase 6-10 times per period 

which data characteristic {data field) the relationship dclcr- ($2630). By summing "upward" all branches at each level, 

minalion is to be based on. For example, the user might the system can determine the total average purchases of all 

wonder which type of purchase (clothing, shoes, or men/women whose frequency is 6-10, then for all men/ 

cosmetics) seems to be most highly dependent on the age of 5 women and then by traversing the tree "downward" the 

the customers. In such cases, the user wiU indicate this to the system can pick the path (order of variables) that gives the 

processing system by entering the independent variable— in greatest total average spending, the next greatest, and so on. 

the assumed example, "age"— via the input system 130 ^^j^, ^^^^ ^j,^ decision tree structure is not limited to numcri- 

(The processing system may, for example, display a list of ^.^^ ordered fields. 

the field names on the display, from which the user can select ,„ , , . ', , , , , . . 

in any standard way.) The preferred embodiment of the I" *°o'her, straightforward structural description of the 

invention is not restricted, however, to beforehand know!- data base the system compiles and inspects the distnbution 

edge of which data field is to be the independent variable in of distinct values (or number of values in a plurality of 

order to determine the structure of the data set, ahhough this distinct intervals). This may be done cither independently or 

will in general reduce processing time and memory require- in conjunction with the construction of, for example, a 

ments. decision tree. 

One method of determining the structure of the data set is Yet another way to determine the structure of the data base 

statistical correlation, either directly, using standard is by using a neural network. The theory and conslraction of 

formulas, or in conjunction with determining the regression neural networks is well documented and understood and is 

(especially linear) parameters for any two selected fields of therefore not discussed in detail here. Of note, however, is 

data. For each possible pair of difi:"ercm data fields, the 20 (bat neural networks must in general be "trained" to stabilize 

system calculates the statistical correlation and stores the ^^^^ jj,,^ before Ihey 6an be used on "actual" data 

resulting correlation values in a correlation matnx in the ^^^^ ^^^^^^^ invention, the use of a neural 

memory. ITie system then identifies the max.muni correla- ^^^^ ^^^^^jj ^ ^ beforehand 

tion value for each field taken in turn as the independent , , . _ ., . .u . r j . • .u j . u .u . 

variable, and orders the remaining variables in order of 25 knowledge of at leastlhe type of data m thedata base,so that 

decreasing correlation. Note that statistical correlation will " ^' °^ da a can be compiled and used to 

in general be a meaningful measure only of the relationship '"'Q "'^ ^""^'''^ "^^"^ °^ ""^ ^^"^ 

between sets of quantitatively ordered data such as numeri- 8^"^^^' 'yi*' distribution, and with the same general data 

cal field data relationships as those actually presented to the processing 

Moreover, 'if the user indicates which of the m variables 30 ^y^'^"" "]^"y '^^/s 'his will not be pos.sible; 

(that is. which of the m fields) is to be used as the indepen- "'hers, however, it often is for example, where the data 

dent variable before the system begins correlation ''^ ntimencal and dependent on an underlymg set of 

calculations, then the system need only calculate and order substantially constant niles or natural laws such as meteo- 

the resulting (m-1) correlation values. If the user does not rowgica' data. 

indicate the independent variable for this or any other 35 Assuming some other method is first applied to determine 

structure-determining method, then the sy.stem may simply membership functions for the different vanables (data 

assume each variable to be independent in turn and then fi^'ds), fuzzy logic techmques may also be used to measure 

calculate correlations with aU others; the greatest correlation ihe strength of relationships among pairs or groups of 

found can then be presented initially to the user. variables. 

Another method for determining structure is the decision 40 '^"'er structure-determining methods include predictive 

tree, which can be constructed using known methods. See. rule-based techniques, which are described, along with still 

for example. Data warehousing: slralegies, technologies other methods, in Data warehousing: strategies, lechnoh- 

und techniques, Ron Maltison, McGraw Hill, 1996. As an S'^^ ""'^ techmques, Ron Matlison. McGraw Hill, 1996. 

example, consider FIG. 4, and assume that the independent Each different method for determining the structure of the 

variable of interest to the user is F5. that is, total spending, 45 data corresponds to a particular measure of what is meant by 

In the illustrated example, the structure sub-system deter- the "closeness" or "strength" of the relationship between 

mines that 30% of the data records are for men and 70% two or more data fields. In many cases, only one of the 

correspond to women. Note that this data will preferably different structure-determining methods in the sub-system 

already be available in the range data table under entries for 144 will be suitable for the detected data types. For example, 

number of occurrences of each state of each field. Note aUo 50 statistical correlation may be the most suitable method if all 

that variable values may be defined as intervals, not only as of the data fields correspond to numerical data, whereas 

individual values; thus, solely for the sake of simplicity of decision trees will normally be more eflScient for ordering 

explanation, in the illustrated example, frequency data (field fields of strings or Boolean data. "Suitable" and "eiEcient" 

FIO) is given as one of three intervals: 0-5. 6-10, and more may be defined and calculated in any predetermined, known 

than ten purchases per time period, both for males and for 55 sense to determine a validity value indicating the validity of 

females. For each frequency and for each sex, the data is the corresponding measure. Furthermore, in many cases the 

further branched into age intervals: under 20, 21-35. 36-55, methods themselves will reveal their own unsuitability. For 

and over 55. (The decision tree will normally continue to example, if almost all data field pairs have statistical corre- 

branch further in order to include the possibilities for the lalion near zero, then a different method, such as a decision 

other fields, but these have been deleted in order to simplify so '^s. 's almost certainly indicated. 

the discussion, with no loss of generality.) The total average Common to all the structure-determining techniques 

spending for each branch is indicated at the tip of the applied by the structure subsystem is that the sub-system 

lowermost branch (the independent variable). For example, determines a measure of relevance for each data field. In 

the total average spending of the group of 36-55 year-old some cases, the relevance measure for a given field may be 

men who purchase 6-10 times per lime period is S704. 65 wholly independent of other fields. For example, one 

Given the illustrated tree's ordering of branches, one can straightforward measure of relevance might be a count of 

sec that the highest level of average total spending is for how many fields have a certain value, or how many distinct 
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values the field holds. This might be very relevant, for 
example, in evaluating the sales of some particular product, 
regardless of other sales. 

In other cases, the measure of relevance may be a measure 
of dependence of some set of dependent, secondary vari- 5 
ables (fields) on some base, independent variable (field) 
selected either automatically or by user input. One method 
for automatic selection would be to use as the independent 
field the same field ultimately selected by the user during a 
previous evaluation of the same or a similar data base or 
through user input. Another automatic method would be for 
the system to be connected to an existing expert system, 
which then selects the independent field. Yet another auto- 
matic method wouJd be for the system to determine all 
possible pairs (or some pre-determined or heurislically jj 
determined number of pairs) of fields, then evaluate the 
relevance measure for each pair, and then order all the 
results for user evaluation and selection. Statistical correla- 
tion (alone, or in conjunction with a linear regression or 
other curve-fitting routine) is one example of a measure of 
relevance that is based on a measure of dependence. 

Once the system has calculated the relevance measure for 
each of the fields, then it preferably presents the results to the 
user by displaying the corresponding field names (or some 
other identifier) in order (for example, decreasing) of their 25 
relevance measures. Where the relevance measure involves 
dependence of secondary fields on a chosen base field, then 
the system preferably displays an indication of which field 
is the base field and in what order the other fields depend on 
it. The dependent variables are, for example, ordered in 3Q 
terms of decreasing dependence so that the user is given 
guidance as to which relationships may be of greatest 
interest. (As is described below, the user can change the 
order of presentation and the plotted, relationship- 
visualizing display.) 35 

At any time after the system has determined the type and 
ranges of the various data fields, the system proceeds with 
query device selection. Consider once again FIG. 2. In the 
preferred embodiment of the invention, the initially pre- 
sented query device will depend primarily on how many 40 
different passible values a data field can assume. The 
thresholds for selecting the different query devices will be 
predetermined and pre-programmed into the system, but can 
be changed under user control after initial presentation (for 
example, by activating a icon of the desired query device 45 
and then "dragging" it to the currently displayed query 
device, by activating and selecting from a pull-down menu 
adjacent to the currently displayed device, or by using any 
other known technique for changing portions of a graphical 
user interface). For example, if the data type if Boolean, a 50 
toggle may be predetermined to be the initial query device 
selected. For string and/or numerical data with fewer than, 
for example, seven different values, a checkbox or pull- 
down menu may be the default query device. For fields 
(variables) with more than some predetermined threshold 5s 
number of different values, however, the default query 
device may be a single or range slider, depending on the data 
type. 

More detailed discussion of query device selection is 
disclosed in the inventors' article (also mentioned above) 60 
"Exploring Terra Incognita in the Design Space of Query 
Devices," C. Ahlberg & S Truv^, Dept. of Computer Science 
and SSKKII, Chalmers University of Technology, Goteborg, 
Sweden, which also discusses the scaling of sliders as a 
function of their range. For example, the lower limit of the 65 
data values may be placed at the left end of the slider scale, 
the upper limit at the right end, the different gradations or 



value marks (AB — CD... 

E.K L MNPR S W) can be displayed 

adjacent lo the slider, and centered on the previously cal- 
culated median or average value. 

The chosen query devices for the different data field 
variables are then preferably displayed on the display in the 
order of dependence on the chosen independent variable. 
The ordering used is, preferably, at least initially, as deter- 
mined by the structure-detecting method that calculated the 
most significant dependence relationship in any pre-defined 
sense, that is, has the greatest validity value. For example, an 
indication of the name of the independent variable (that is, 
its field name) may be displayed in some prominent position 
on the display screen, and the other query devices are then 
preferably positioned top-to-bottora, left-to-right, or in some 
other intuitive way so as to indicate decreasing measured 
dependence on the independent variable. 

Once the query devices are sorted and displayed, the 
system preferably also displays an initial plot (for example, 
X-Y, pie chart, bar graph, etc.) of the relationship. The initial 
type ofplol, its scaling, color scheme, marker lype, size, and 
other features — in short, the view selection — are preferably 
selected in any conventional manner, 

FIG. 5 illustrates a simplified display screen correspond- 
ing to one possible set of data processed by the invention 
using the earlier example of a data base of sales slalislics. In 
the example, the system has determined that the strongest 
relationship, given the independent variable age, is with 
purchase frequency, followed by sex, then recency, and so 
on, which is indicated to the used by displaying the corre- 
sponding query devices vertically in descending order. In the 
example, rangesliders were indicated and automatically 
selected for the fields "Frequenc/' and "Recency", whereas 
toggles were chosen for each of "Male" and "Female," since 
they can be plotted non-exclusively using different data 
markers, for example "A" and "O". The structure detection 
method (measure) with the best validity value is displayed in 
display region 500 as the decision tree ("TREE"). The 
default plot type is shown in region 502 as an X-Y plot. By 
activating, for example, conventional pull-down menus such 
as 530, 540 and 550, the user may direct the system to 
change the query device for any given field, the measure to 
be used to determine the order of dependency of the depen- 
dent variables (data fields), for the plotting the plot lype or 
color scheme, etc. 

Using a pull-down menu, the user had selected "AGE" as 
the independent variable, and, using a different pull-down 
menu 506, indicated to the system that TOTAL PUR- 
CHASES should be plotted against AGE. Rangesliders 508, 
510 are preferably displayed on the respective x- and y-axes 
to allow the user to adjust (by moving the range markers 
with the cursor 220) the displayed ranges. In the illustrated 
example, the system plot only the data for which the 
frequency lies in the range 2-25, the recency lies in the range 
0-24, since the user has moved the range markers of the 
respective range sliders accordingly. 

Using known techniques, the system continually senses 
the state of all toggles, range and alphasliders, etc. and 
whenever a change is detected, it re-plots the selected 
relationship to include only the desired data characteristics. 
For example, if the user were to "click" on the toggle for 
"Male" ("M"), so that it is de-selected, then the system 
would remove the "A-marked" data points on the plot 520. 
As the user changes the settings of other query devices, the 
system updates the display accordingly to include only the 
field data that falls within the indicated ranges. This allows 



03/04/2004, EAST Version: 1.4.1 



us 6,405,195 Bl 

15 16 

the user to view and change the data base presenlalion lure" PKl schemes) for data tratisfcr to and from the enter- 

inlcractively, so thai there is no break in concentration atid prise 600. Other systems communicating with the enterprise 

exploration of the data base Cor time-consuming 600 will then include similar portals and transfer data using 

re-submission of conventional queries. the same encryption standard. The data communication will 

Administrative information such as data set file name, the 5 then be secure and private, even though it is taking place 

number of records, and the date and time may also be over a (preferably) public network 614. This arrangement is 

included on the display screen as desired and as space known widely as a "virtual private network" (VPN). This 

permits. invention also works well with multiple, collaborating sites 

Reniotc Processing Embodiment or entities, any or all of which may exist behind respective 

According to a further embodiment of the invention, the lo firewalls and communicate via the public network, usually 

input unit 130 and the display 120 are located at a user's site via a secure VPN arrangement. Such cooperating, collabo- 

that is remote from the processing system 110 itself. ITie rating entities are widely known as an extended or virtual 

data base (or several data bases, depending on the enterprise. 

application) itself is located either at the user's site, or at one Other enterprises, such as most individuals and small 

or more third-parly sites that are accessible to the processing 15 organizations, do not have a firewall, but rather allow direct 

system. In other words, the data lo be analyzed is located connections between the individual computers within the 

locally, that is, at the user's site or at a third-party site organization and the network 614, or between an internal, 

designated by the user, or both), as are the devices needed to local network and the public network 614, or both. This 

submit queries and view the re.iulls, but the actual process- remote proces.sing embodiment of invention will work with 

ing i.s carried out remotely, and is accessed via a network. 20 either arrangement, and the term "enterprise" is used here to 

This allows a user who does not locally have the data denote any such user, whether an individual computer 

analysis capability provided by the processing system system, or an entity with or without an internal network of 

according to the invention lo still have the benefit of it. The several connected computers communicating either inde- 

general configuration of this embodiment of the invention is pendently with the external network 614 or only through a 

illustrated in FIG. 6. 25 common server, and either with or without firewall protec- 

An enterprise 600 (any number of which may be included tlon and with or without a hardware and/or software com- 

in the invention) is a local system, that is, located at a user's ponent providing VPN capability, including extended or 

site, that includes at least one (and often many) local virtual enterprises. 

processor or processing system 610, typically a network The only requirement is that at least one local processing 
server, one or more data bases DHl, . . . , DBn, conventional 30 system 610 should be able to connect to the network 614, 
browser software whose results can be viewed on a conven- either directly (using any known technology such as dial-up 
tional display 620, and a conventional input device 630. 'llie connections, DSL, satellite, etc.) or indirectly, via one or 
display 620, which is preferably controlled by a browser or more intermediary servers, such when using a corporate 
similar software, and input device 630 in this embodiment server and/or a third-party Internet Service Provider (ISP), 
correspond to and may in fact replace the display and input 35 and should allow data transfer (downloading) via the net- 
devices 120, 130 shown in FIG. 1. work. 

In the generalized embodiments of the invention There are many known techniques by which processors 

de.scribed above (see FIG. 1), the processing system 110, transfer data over a network such as the Internet and either 

which forms the data analysis system, is connected to one or allow access to data bases or to transfer the content of such 

more data bases 100 via the channel 114. In this remote 40 data bases via the network to other systems such as the 

processing embodiment of the invention, the channel is a processing system 110. In the Internet context, several 

conventional data network 614. The network may be internal protocols, such as the File Transfer Protocol (FTP) are 

to the enterprise, such as a standard local area or proprietary standard and well known. Similarly, the techniques used for 

network, for example, connecting many different sites of a transferring display-control data, and for inputting and 

large corporation. This embodiment of the invention is, 45 uploading parameters via a text-based or graphical user 

however, most useful when the network 614 is a wide-area, interface are very well known to any user of, for example, 

publicly accessible network such as the Internet, since it then the Internet. For example, the HTML (hypcr-tcxl markup 

allows not only for the widest range of users to take language), XML (extended markup language, and Java are 

advantage of the data categorization, analysis, and visual- commonly used in transfers in order to generate displays, 

izalion capabilities of the processing system according to the 50 photos, text, and so on, on a computer display connected to 

invention, but also makes it possible for a user to access, the Internet. Any such conventional transfer protocols and 

analyze, and visualize data in any data base accessible languages may be used to implement the various data 

through the network, that is, even in third-party data bases, transfers and display generation in this invention, 

as long as they are accessible via the network. In this remote processing embodiment of the invention. 

In many corporate enterprise systems, a so-called "fire- 5,5 the processing system 110 (FIGS, 1 and 3) is hosted 

wall" 640 is implemented to isolate the hardware and remotely, that is, separate from the enterprise 600. In this 

software componenLs of the system from the public network context, "separate" means that the enterprise 600 is con- 

114 (such as the Internet) in order lo protccl the system nected to the data analysis system 110 only via the network 

against corruption (.such as from viruses) and intru.sion (such 614. 

as from "hackers") from ouLside sources also connected lo 60 As is indicated in FIG. 6, the analysis (data processing) 

the network. It is usually desirable lo allow at least some system 110 in this remote processing embodiment of the 

contact with other entities via the network 614. One way lo invention does not require a separate input and display 

ensure this without loss of security is lo include a well- system as shown in FIG. 1, but these devices may of course 

controlled and monitored portal 645 through the firewall. also be included as needed. The analysis system may itself 

This connection may be made even more secure by imple- 65 be provided v^th a conventional VPN module 655 that 

mcnling any standard or agreed-upon encryption scheme corresponds to Ihc portal 645 of the cnlcrprisc system, 

(for example, any of the widely used "public key infraslruc- Communication over the network (indicated by the dashed 
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portion of the line connecting VPN 645 and VPN 655) can 
thus be made secure, using any conventional encryption 
routine for data transfer between the enterprise system 600 
(in particular, the local processor 610) and the remote, 
hosted data analysis system 110. 

In nc 6, the module labeled decision support 660 is used 
to indicate, collectively, the various modules for data typing 
140, range determination 142, structural analysis 144, inter- 
face selection 146 and, if needed, sampling 148 as shown in 
FIGS. 1 and 3, as well as the memory 112. In effect, in this 
remote processing embodiment of the invention, the main 
processing system 110 remains at a host site, but the data 
base(s) 100, user input 130 and display 120 functions are 
localized to the user's site, the connection between the 
processing system and the data to be processed being the 
network 614. 

It is not necessary for all the data designated for analysis, 
classification, visualization and display to be located within 
the enterprise 600. Rather, the invention may also be used to 
process data in one or more publicly available data bases, 
one of which is indicated in FIG. 6 as the data base DBF. 
Note that, in this case, data transfer will in general not be 
secure, as is illustrated by the direct connection between the 
decision support module 660 and the network 614. Such data 
transfer can be carried out using any of the well-known, 
widely used techniques now available for network file and 
data transfer. 

Any known mechanism may be used to identify the 
various systems connected to the network 614. Common 
identifiers for Internet sites include Universal Resource 
Locators (URL's) and Data Source Names (DSN's). 'ITie 
methods by which one network entity (such as the enterprise 
600) addresses another (such as the data analysis system 
110) and request and transfer data are well known. Any such 
method may be used according to the invention. 

Assume that a user at the enterprise 60O wants to analyze 
and visualize data in either internal data bases DBl-DBn, or 
an external data base DBF that is addressable via the 
network, or both. The user first submits a request to the 
network (using conventional methods), for example, using 
the browser 620 or using some other known software 
direcdy through the local processor 610, for access to the 
data analysis system 110. This may be done, for example, by 
submitting the URL of the system 110. Once the connection 
has been established, then the data analysis 110 transmits to 
the user's browser 620 a standard analysis request screen. 
The user then identifies for the analysis system 110 the 
network addressc(s) or similar parameters enabling the data 
analysis system 110 to address and access the data bases 
whose data the user wishes to have visualized. 'Fhe data 
analysis system 110 then carries out the procedural steps of 
data analysis described above: 1) accessing the user- 
identified data base(s) and exchanging standard protocol 
information (including encryption parameters to establish a 
VPN, if implemented); 2) downloading records and classi- 
fying them by type; 3) determining the range of the data for 
each field in the set; 4) analyzing the relational structure of 
the data records; S) determining a user interface to be 
displayed on the user's browser 620; and, if needed, 6) 
sampling the data. 

The user may then adjust the interface query devices as 
displayed on the browser 620 (or analogous monitor display) 
and submit the adjustments via the browser 620 using any 
conventional techniques. In FIG. 6, a range slider for a 
parameter PI is shown, as are two check boxes for factors 
Fl and F2 and a hypothetical two-curve display of the 
results. This is of course just one very simplified example of 
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a possible display. FIG. 5 is another The data analysis 
system 110 then in turn regenerates the display in accor- 
dance with the adjusted input query devices (which form 
display parameters). 

5 The data analysis system llO preferably also includes an 
repository for various structure-determine algorithms, such 
as linear regression, curve-fitting, neural networks, etc. The 
user may then also select which algorithm he prefers for a 
given data analysis rather than accept the structure algorithm 

10 that the analysis system would select automatically as 
described above. This feature may be needed in certain 
applications, such as clinical drug trials, in which the 
measure of relevance is prescribed. 
Remote Hosting Embodiment 

15 FIG. 7 illustrates the general configuration of an embodi- 
ment of the invention in which a host system 700 is 
connected via the network 614 to one or more user systems 
702, 704, 706, which arc also labeled USERl, USER2, . . . , 
USBRn for more ready reference. Merely by way of 

20 example, In FIG. 7, features that are identical to those 
included in previously illustrated embodiments and aspects 
of the invention have retained the same reference numbers. 

In this embodiment, each user system USERl , . . . 
USERn, as in earlier embodiments, is assumed to include a 

25 standard display capability, some standard input device and, 
in most cases, both. These are not shown in FIG. 7 merely 
for the sake of simplicity. As FIG. 7 illustrates, the user 
systems may, but need not, include or at least be able to 
access one or more local data bases, for example, DBUl at 

30 the USERl site, DBU2 and the USER2 site, and so on. Note 
that any user site may contain any number of data bases, 
including none at all. Single data bases arc shown in FIG. 7 
for USERl and USER2 merely for the purpose of illustra- 
tion. 

35 Each user who wishes to be able to view the results of a 
data analysis will also include some conventional display 
software; those who also wish to be able to adjust query 
devices and thus interactively explore relationships in the 
data will also need conventional software that enables 

40 on-screen manipulation of the query devices. A conventional 
browser (such as Internet Explorer or Netscape Navigator) 
is, as is well known, software suitable for both these 
tasks — data presentation and query device adjustment — and 
is therefore preferably included in each user system for each 

45 user participating in the collaborative data analysis accord- 
ing to this embodiment of the invention. In FIG. 7, a browser 
is shown as being included in the USERl system, but may 
be assumed to be in other user systems as needed. 

In this remote-hosting embodiment of the invention, the 

50 host system includes, as in other embodiments, a processor 
610 or system of processors, which is connected (via a 
conventional I/O device such as a modem, IDSN interface, 
etc. — not shown) to the network 614 either through a 
firewall, or directly, or both, depending on the configuration 

55 preferred in any given implementation. As is explained 
ftirther below, the host system should be able to receive and 
store data for analysis, to analyze the data, to communicate 
the results of the analysis for preferably visual presentation 
to one or more users. As such, in this embodiment of the 

60 invention, the host system serves as a network portal through 
which users may collaboratively explore data that they 
themselves have generated through other conventional 
means, with no need to individually acquire and run data 
analysis software. Improvement to the data analysis routine 

65 can thus be made readily available to all users who partici- 
pate in the system, simply by updating the software in the 
host system. 
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In the preferred embodiment, in which the automatic data queries back over the network to the host system, whereupon 

analysis and query generation routine described above is the decision support system 110 regenerates the display 

implemented, the host system also needs to be able to sense according to the updated queries. 

changes by a user to displayed query devices, and to then FIG. 7 illustrates just a few of the many possible combi- 

transmit the correspondingly updated visualization of the 5 nations of data that can be presented for analysis to the 

data to the user. The host system will therefore in most cases decision support system 110. In this example, USERl and 

be implemented using a conventional network server, which USER2 have uploaded the data in data bases DBl and DB2, 

typically have all necessary hardware and can be pro- respectively, into their respective user memory regions 

grammed using known software techniques to accomplish USERl and USER2. Each user has also uploaded a link: 

all these tasks. Another advantage of using a server is that it lo USERl has uploaded a link (LINK DB2) to the DB2 data 

will typically have enough memory for a large user space, stored in the USER2 region, and USER2 has uploaded a link 

and can handle more than one network access at a time, thus (LINK DBP) to an external data base DBF, which is 

allowing several dilferent, unrelated data analyses, that is assumed to be accessible via the network, for example, from 

data exploration or "data mining" operations, to take place a network server. The data that USERl wishes to be included 

at the same time. Even common standard computer systems, is in the automatic aoalysis is thus the data in DBUL and 

such as "personal computers" will, however, in most cases DBU2, and the data that USER2 wishes to be analyzed is the 

have sufEcIent memory and include a modem or other data in DBU2 and DBP. 

network-connection hardware to handle at least one user According to this embodiment of the invention, one or 

who is accessing the data analysis module according lo the more users (user systems) may thus be the source of data for 

invention. 20 any given analysis, and the results may be made available to 

As in other computer systems, the host system 700 any user who is participating in the system, or may be sent 

preferably includes system software 720, such as an oper- by any participating user, for example by e-mail attachment, 

ating system, various device drivers, etc., all of which are even toother non-participants. In order to protect the privacy 

well-known in the art of computer science. As is also of data uploaded into any user memory region, or otherwise 

well-known, such system software coordinates, in any 25 stored and referenced in the host system, conventional 

known manner, the transfer of information between the host password protection may be included lo prevent access by 

system 700 and the network 614, and it also allocates and unauthorized users. In order to protect the integrity of any 

administers the memory within the host used by applications data file, moreover, conventional methods may also be 

such as the data analysis module 110 according to the employed to restrict it to read-only use. 

invention, and an e-mail routine 730 (if included) such as 30 In some implementations of the invention, each user will 

Microsoft Outlook Express. be allocated a certain amount of memory in which to upload 

In this remote-hosting embodiment of the invention, a data. 'I"his would be similar to the manner in which web- 
user space 740 is allocated within the memory, either the hosting services assign lo each member a certain amount of 
system memory, if enough is available, or, preferably, non- memory for that member's web site. However, it is possible 
volatile storage such as disk memory. This user memory J5 when using this embodiment of the invention for the amount 
space is then preferably partitioned or allocated into regions of data to be analyzed to be very large, possibly exceeding 
in such a way that each user who is participating in the any preset size limit. This situation may be handled in 
remotely hosted data analysis system according to this several ways. One way is simply for the user system to 
emhodiment of the invention is allocated a memory region. request the host system to allocate more memory, at least 
Each user may then transfer, thai is, upload, via the network, 40 temporarily. In most cases, any transfer of a data base also 
either data to be used in an analysis, or network links (such first involves transferring information concerning the size of 
as URL's) to such data, or both. By way of example, in RG. the data base, the number of data sets included, the number 
7, a memory region has been allocated lo each of the users of records in each data set, and the number of fields in each 
USERl, USER2, .... USBRn, each of whom is assumed to record. If any data set exceeds some predetermined 
be a participant in the system. USERx, however, is assumed as threshold, as mentioned above, then the host system could 
not to be a participating user, and thus has no allocated instead direct the user system lo transfer only a subset of the 
memory region in the user space 740. data according to a sampling procedure, such as those 

Methods for accessing a server and for uploading data, for described above, 

example, using the File Transfer Protocol, mto a dedicated Still another way is simply to design the host system to 

memory space of the server are known. For example, so include enough mass storage to handle all the anticipated 

web-hosting services use this known uploading procedure to uploaded data. An estimate of the need may be determined 

allow users to store the HTML (for example) code for their using known methods, and memory may be added as the 

web sites, as well as data (including images, executable number of users or the size of the data sets increases. This 

code, etc.) that may be downloaded by those accessing the is, once again, a common problem facing web-hosting 

respective web site. 55 systems — as the need grows, more servers are added, as well 

In general, in this embodiment of the invention, users as more memory capacity, 

upload into their respective user .spaces the data and links to It is not necessary lo store the actual uploaded data sets in 

data that is lo be used in the data analysis. The decision the memory regions of the respective users. Rather, all data 

support system 110 then accesses and analyzes the data as may be stored in the conventional manner in the memory of 

before, preferably including automatically generating initial 60 the host system; the user memory regions will then contain 

query devices, and makes the results available to one or the address information to the regions of memory where the 

more of the participating users. These results are then data is stored in the host system, or the network address of 

transferred in a conventional manner over the network for externally located data, such as DBP. Such file allocation 

display on one or more user systems, preferably on their systems are well known in the art of computer science. Note 

respective browsers. A user may then alter one or more 65 that one simple addressing scheme would be to assign each 

queries as described above, for example by adjusting the stored data scl a network address (such as a URL), once 

respective displayed query devices and submit the adjusted again, analogous to ihc manner in which web sites arc 
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structured and addressed. Addresses pointing within the host usei(s), which data, and which query states were included in 

system could then be translated to standard file location a particular analysis, visualization or report, possibly with a 

information using a simple allocation and conversion table. copy of those results. In effect, the log could, as its name 

This would enable uniform addressing of all data sets to be implies, be a record of the analysis history for any body of 

used in an analysis, not only by the host system, but also by 5 data. Another advantage of maintaining the logs, besides the 

other users who might want to access the data for analyses obvious advantage that users can track their work over time, 

of their own. USERl could thus easily make his data (for is that they would also enable audits. For example, pbarma- 

cxamplc, DBUl) available to USER2, for example, simply eeutical companies could make Ihcir logs available to regu- 

by telling USER2 what the network address — the link — is to latory authorities in order to validate the results of clinical 

DBUl. It would then not be necessary to upload the data into to studies. 

the host system again, since it would already be stored there. In addition to, or instead of the logging feature, another 

Even users, for example, USERx, who are not participating feature that may be included in this embodiment of the 

members in the data analysis system could then also access invention is e-mail notification of any access or change of 

the data, given the network address and suitable aulhoriza- uploaded data in any user's memory region. Whenever data 

tion codes. 15 in any user's memory region is accessed or used by any user 

When any user has uploaded into his user region either all besides the one who originally uploaded the data into the 

the data to be used in an analysis, or the links (addresses) to host, then the host system would transmit a message to this 

the data, or both, he may use any known method to indicate effect as e-mail, via the network, to the original user, 

to the host system that data analy.sis should begin. For preferably with information identifying the user accessing 

example, the user could "click" on a suitable icon on a 20 the data. Such accesses could then also be logged in the 

display generated by his browser upon network connection user's memory region. This would not only increase the data 

to the host system; the analysis/decision support system 110 security of the system, but it would also be useful for 

will then proceed to classify and analyze the data included coordinating multiple analyses involving, at least in part, the 

and/or indicated in the respective user memory region. same data. 

In most cases it will be preferable to upload all the data 25 This remote-hosting embodiment of the invention is par- 
to be analyzed into the memory of the host system, since licularly advantageous when more than one researcher wants 
data access during the analysis will then be much faster than to explore data, which may have been generated by any or 
the network transfer speed. This also makes the data rcadUy all of them. It also allows them to link into even other 
accessible to other participating users. Thus, whenever the externally generated data, as long as it is available in a 
decision support system 110 of the host system is pointed by 30 known format via the network (for example, demographic or 
way of a link (for example, a network address) to a data base meteorological data often made available by governmental 
whose data has not already been retrieved into the host agencies), and thus explore possible relationships with data 
system memory (either into one of the user regions or into gathered from outside their own research team. Results (in 
some other temporarily or permanently allocated storage particular, display data that visualize an especially interest- 
region), then the host system preferably accesses the data 35 ing relationship) can also be sent, for example, as an 
base corresponding to the link and retrieves the data. This attachment in any conventional format to e-mail generated 
may be done, for example, when the user indicates that data and transferred in the conventional manner by the e-mail 
analysis should begin, or in accordance with some other module 730, to anyone who Ls able to accesa the network and 
preparatory command for the host system to retrieve the data who has a browser or similar software that is able to receive 
sets to be used in the later analysis. 40 and display the data. 

Once the analysis/decision support system has completed The automatic data classification and analysis method 

its initial analysis of the data as described above, it will then described above is of course the preferred method carried 

have selected initial query devices, and will have the display out by the analysis/decision support system 110, since its 

data corresponding to the initial analysis, given the selected advantages are just as beneficial in this remotely hosted 

relevance measures, etc. This information, represented in 45 embodiment of the invention as in any other. Indeed, it is 

FIG. 7 as QUERIESl and RESULTl for USERl and QUE- particularly advantageous in this embodiment, since users 

RIES2 and RESULT2 for USER2, can then be transferred may submit for analysis data from several different sources, 

for display and adjustment to the respective user's systems so that it will often be especially difficult for the user to 

702, 704. 'Iliis information could also be transferred for know ahead of time what postulated relationships (relevance 

display to more than one user, so that these users may 50 measures) are most likely to yield interesting and perhaps 

independently or collaboratively adjust the queries and see even surprising results. 

buw the adjustment aileclslhe displayed visualization of the On the other hand, the main feature of this embodiment of 
relationships between the various data sets included in the the invention is that one or more users can upload data for 
analysis. Note that once an analysis has been completed, analysis into the host system, which carries out the actual 
there is no need to redo it just because a query has been 55 analysis; the host system thus acts as a network portal and 
changed; rather, the re.sulLs will be available for vLsualization needs only to be able to access the uploaded user-specified 
even later, and as long as the structure (relevance measures) data to be included in the analy.sis. As such, this embodiment 
determined by the decision support module is not changed of the invention does not presuppose any particular analysis 
by the aser, then the u.ser can continue with his analysis by routine. The analysis method de.scribed above, with auto- 
viewing the previously stored results (for example, 60 matic selection of relevance measures and/or even query 
RESULTSl) and adjusting the queries as normal. devices, is preferred because of its flexibility and ability to 
According to another feature that may be included in this so effectively make it possible for users to visualize and even 
embodiment of the invention, the host system may also discover relationships about the data, 
maintain a log of the actions taken by users with respect to Rather, the analysis/decision support system 110 could 
any data analysis. In FIG. 7, a log is shown in each user 65 implement a particular, pre-determined analysis routine, or a 
memory, that is, LOCI, for USERl, L0G2 for USER2, etc. library of possible analysis and visualization methods (for 
The log may contain, for example, the history of which example, linear regression, polynomial, trigonometric or 
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similar data-filting algorithms, neural networks with prede- 
termined initial structures, etc.). Along with submitting data 
for analysis, the user could then also select which of the 
analysis routines should be applied, for example, by select- 
ing it from a browser-displayed puU-down menu. Of course, 
as is mentioned above, this feature may be needed in certain 
applications, such as clinical drug trials, in which the 
measure of relevance is prescribed. 

The analysis/decision support module could, for example, 
also implement conventional text-based (for example, 
keyword-driven) querying and reporting, or multi- 
dimensional, hierarchical data analysis and visualizaiion. 
'Ilie user could, for example, then, after or in conjunction 
with uploading data into the host system, upload text queries 
and view results as in conventional systems, except that, 
using this embodiment of the invention, the data being 
analyzed will have been uploaded from one or more users 
into the host portal. 

Because this remote-hosting embodiment of the invention 
is not dependent on any particular analysis routine, it is also 
not limited to any particular data structure. Using the pre- 
ferred analysis routine, with automatic selection of rel- 
evance measures and initial query devices, data in the data 
bases will typically be organized into records, with each 
record including at least one field. The actual data structure 
used to organize the data uploaded into the host system will, 
however, depend on which type of analysis routine is to be 
invoked for the analysis. If the data is to be used for 
conventional text-based searching, then the data structure 
can be a simple one-dimensional list. In general, for any 
n-dimensional analysis, visualization, text-based report, etc., 
the data should typically be able to be classified into at least 
a corresponding number n of diCTercnt sets that can be 
compared using some measure of relevance. 

What is claimed is: 

I. A method for processing data firom at least one data 
base, in which each data base contains a plurality of records 
and each record includes a plurality of data fields, and each 
field contains field data, has a field name and one of a 
plurality of data types, comprising the following steps: 
receiving into a host system, via a network, the data from 
the at least one data base from at least one participating 
remote user system that is separate from the host 
system, 

in the host system, upon receipt of a request for initiation 
from the remote user system, analyzing the data from 
the at least one data base according to an analysis 
routine and generating analysis results; 

in the host system, generating a representation of the 
analysis resulLs; and 

transferring the representation of the analysis results via 
the network for display on at least one participating 
remote user system; 

in a decision support module in the host system, auto- 
matically selecting an initial, adjustable, graphical 
query device as a function of and adapted to a type and 
range of the corresponding field data; 

transferring each graphical query device via the network 
to at least one participating user system; 

sensing, via the network, adjustment by the user of each 
participating user system to which each graphical query 
device has been transferred of any of the displayed, 
adjustable, graphical query devices; and 

in the host system, updating the representation of the 
analysis results corresponding to the sensed adjust- 
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mcnts of any of the query devices, thereby enabling 
interactive visualization of the analysis results of the 
data via the network. 

2. A method as in claim 1, in which at least one of the user 
5 systems to which graphical query devices are transferred is 

one of the participating user systems other than the partici- 
pating source user system. 

3. A method as in claim 1, further including the step of 
allocating, for each participating user system, a correspond- 

10 ing memory region in the host system, each memory region 
storing: 

data from the at least one data base transferred via the 
network firom the respective participating user system 
to the host system; and 

a log of accesses to the data stored in the respective 
memory regions. 

4. A method as in claim 1, further including the step of 
notifying, via the network, each user whose corresponding 
data, stored in the respective memory region, is accessed by 
any other participating user. 

5. A method for processing and visualizing data from at 
least one data base, in which each data base contains a 
plurality of records and each record includes a plurality of 
data fields that include field data, comprising the following 
steps: 

receiving in a host system, via a network, from at least one 
remote participating user system separate from the host 
system, the data from the at least one data base; 

33 in the host system, upon receipt of a request for initiation 
from the remote user system, analyzing the data from 
the at least one data base by detecting a relational 
structure between the data fields by calculating a 
respective relevance measure for each of the data fields, 

35 the relevance measure being a data type-dependent 
function indicating a measure of relational closeness 
between data in at least one of the data fields of the 
plurality of records to data in at least one other of the 
data fields of the plurality of records; 

40 in the host system, generating a graphical representation 
of the relational structure; 
transferring the graphical representation of the relational 
structure via the network for display on at least one 
participating user system; 

■'S for each of the data fields, in a decision support module 
in the host system, automatically selecting an initial, 
adjustable, graphical query device as a function of and 
adapted to the type and range of the corresponding field 
data; 

transferring each graphical query device via the network 
to at least one participating user system; 

sensing, via the network, adjustment by the user of each 
participating user system to which each graphical query 
device has been transferred of any of the displayed, 
adjustable, graphical query devices; and 

in the host system, updating the graphical representations 
of the relational structures corresponding to the sensed 
adjustments of any of the query devices, thereby 
enabling interactive visualization of the relational 
structures of the data fields via the network. 

6. A method as in claim S, which at least one of the user 
systems to which graphical query devices are transferred is 
one of the remote participating user systems other than the 
initiating, participating source user system. 

* * « * * 



03/04/2004, EAST Version: 1.4.1 



